The dimensions of the train dataset:

## [1] 1460   81

Looks like we’re working with quite a bit of data.

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

We have quite a bit of missing data here, let’s take a look to see how much.

##        PoolQC   MiscFeature         Alley         Fence   FireplaceQu 
##          1453          1406          1369          1179           690 
##   LotFrontage    GarageType   GarageYrBlt  GarageFinish    GarageQual 
##           259            81            81            81            81 
##    GarageCond  BsmtExposure  BsmtFinType2      BsmtQual      BsmtCond 
##            81            38            38            37            37 
##  BsmtFinType1    MasVnrType    MasVnrArea    Electrical            Id 
##            37             8             8             1             0 
##    MSSubClass      MSZoning       LotArea        Street      LotShape 
##             0             0             0             0             0 
##   LandContour     Utilities     LotConfig     LandSlope  Neighborhood 
##             0             0             0             0             0 
##    Condition1    Condition2      BldgType    HouseStyle   OverallQual 
##             0             0             0             0             0 
##   OverallCond     YearBuilt  YearRemodAdd     RoofStyle      RoofMatl 
##             0             0             0             0             0 
##   Exterior1st   Exterior2nd     ExterQual     ExterCond    Foundation 
##             0             0             0             0             0 
##    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##             0             0             0             0             0 
##     HeatingQC    CentralAir     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##             0             0             0             0             0 
##     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
##             0             0             0             0             0 
##  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
##             0             0             0             0             0 
##    Fireplaces    GarageCars    GarageArea    PavedDrive    WoodDeckSF 
##             0             0             0             0             0 
##   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea 
##             0             0             0             0             0 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0

Ok, so we have quite a few missing values in some features. Let’s fix that.

What are the most important numerical features, based on correlation?

Let’s take a look at some features that are highly correlated with selling price.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

The distribution is a little long-tail and there are two outliers with square footage greater than 4500, and a sale price less than 200,000. Let’s see how the plot changes without those points.

That looks better, but before we remove those outliers, let’s take a look at the correlation. First with the two datapoints, then without.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6821200 0.7332695
## sample estimates:
##       cor 
## 0.7086245
## 
##  Pearson's product-moment correlation
## 
## data:  a$SalePrice and a$GrLivArea
## t = 41.358, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7104365 0.7577160
## sample estimates:
##       cor 
## 0.7349682

That should help improve the results. Let’s remove those two points.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$OverallQual
## t = 50.141, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7761447 0.8138630
## sample estimates:
##       cor 
## 0.7957743
## 
##   1   2   3   4   5   6   7   8   9  10 
##   2   3  20 116 397 374 319 168  43  16

Everything looks good with OverallQual.

Although OveralCond is not highly correlated with SalePrice, I want to have a closer look, because I thought it would have similar values to OverallQual.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$OverallCond
## t = -2.9834, df = 1456, p-value = 0.002898
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12877065 -0.02671789
## sample estimates:
##         cor 
## -0.07794846
## 
##   1   2   3   4   5   6   7   8   9  10 
##   2   3  20 116 397 374 319 168  43  16

I don’t notice anything worrying/wrong with the data. It looks like the huge range of selling prices with OverallCond of 5 might have ruined any chance of a strong correlation.

## 
##   0   1   2   3 
##   9 650 767  32

This looks good, but now I’m going to focus on continuous variables to see if we can find any more outliers.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$X1stFlrSF and CleanedTrain$SalePrice
## t = 31.08, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5996337 0.6614237
## sample estimates:
##       cor 
## 0.6315304

Everything looks fine with first floor square footage.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$TotalBsmtSF and CleanedTrain$SalePrice
## t = 32.738, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6205598 0.6797668
## sample estimates:
##       cor 
## 0.6511529

Everything looks good here.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$YearBuilt and CleanedTrain$SalePrice
## t = 23.451, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4853166 0.5598955
## sample estimates:
##       cor 
## 0.5236084

It’s interesting to see the housing booms and busts, plus everything looks fine.

There are some big outliers here. Below you’ll see the end result after I experimented with a range of subsets, between lot areas of 15,000 to the maximum value, and 25,00 seemed to be the optimal limit. I also compared the correlations between the SalePrice and other features with, and without, the outliers. Removing the outliers looks to have better or equal correlations, so we’ll go ahead and remove those datapoints.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$LotArea
## t = 10.622, df = 1456, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2198689 0.3151775
## sample estimates:
##       cor 
## 0.2681793
## 
##  Pearson's product-moment correlation
## 
## data:  a$SalePrice and a$LotArea
## t = 17.522, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3772598 0.4626693
## sample estimates:
##       cor 
## 0.4208969

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$LotFrontage
## t = 6.9388, df = 1426, p-value = 5.982e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1300688 0.2304373
## sample estimates:
##       cor 
## 0.1807235

Although this is a significant outlier, my model performs better with this datapoint, so I will leave it in.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$X2ndFlrSF
## t = 12.283, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2616409 0.3554870
## sample estimates:
##       cor 
## 0.3093169

Many houses do not have second floors. The data looks fine.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$WoodDeckSF
## t = 12.041, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2559486 0.3501455
## sample estimates:
##       cor 
## 0.3037892

Many people also do not have wooddecks, but again, the data looks fine.

## 
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      17       2      16      58      18     150      50      98      78 
##  IDOTRR MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown 
##      37      17      45     222      38       9      77      73     113 
##  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
##      73      59      86      24      25      33      10

There are definitely some more expensive neighborhoods, such as NoRidge and NridgHt. We’ll use this information for the feature engineering section.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$GarageArea and CleanedTrain$SalePrice
## t = 31.304, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6063996 0.6679590
## sample estimates:
##       cor 
## 0.6381983

Hmm, let’s see what happens if we remove values greater than 1248.

## 
##  Pearson's product-moment correlation
## 
## data:  a$GarageArea and a$SalePrice
## t = 31.983, df = 1424, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6152984 0.6757865
## sample estimates:
##       cor 
## 0.6465575

It’s only a slight improvement, so we’ll keep the datapoints to have more information for build our model.

## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$MonthYearSold and CleanedTrain$SalePrice
## t = -0.7478, df = 1426, p-value = 0.4547
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07159945  0.03210821
## sample estimates:
##         cor 
## -0.01979888

Given that this dataset takes place during the recession, I was wondering if the selling prices would drop during 2008-2010…they didn’t.

## 
##  20  30  40  45  50  60  70  75  80  85  90 120 160 180 190 
## 521  68   4  12 142 290  60  15  58  20  51  87  63  10  27

It’s a little tough to see any strong insights here. 2-STORY 1946 & NEWER (#60) houses are generally worth the most, but so are 1-STORY PUD (Planned Unit Development, #120) - 1946 & NEWER. Perhaps the number of stories doesn’t matter as much as when the house was made.

## 
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    151     14    707      8     11    435     37     65

Just as a reminder, the correlation between sale price and year built is 0.5487, so I am confident in saying that year built has a stronger correlation with sale price than house style / number of stories.

The earliest value for YearRemodAdd:

## [1] 1950

The number of houses with this minimum value:

## 
## FALSE  TRUE 
##  1251   177
## 
##  Pearson's product-moment correlation
## 
## data:  CleanedTrain$SalePrice and CleanedTrain$YearRemodAdd
## t = 22.922, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4799375 0.5558064
## sample estimates:
##      cor 
## 0.518893

There are way too many houses that have their value for YearRemodAdd as 1950. I am going to assume that this is the earliest date possible for this value, which has led to the error. For houses with YearRemodAdd = 1950, I am going to change their value to the average difference between when the house was built and remodelled, plus the year the house was built. Here’s an example to clear the confusion: YearBuilt = 1930, average difference between year built and remodelled = 4.38, new value for YearRemodAdd = 1934.38.

## 
##  Pearson's product-moment correlation
## 
## data:  a$SalePrice and a$YearRemodAdd
## t = 22.023, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4640384 0.5415082
## sample estimates:
##       cor 
## 0.5037856

Although the correlation went down, I believe these new values more accurately represent the real world.

Before we moved onto feature engineering, it would definitely be worth having a look at the sale prices.

I’ve been seeing the two data points with a sale price of over $700,000 in many graphs. After looking at the quality of my model, I’ve decided to also remove the datapoints with a sale price greater than $600,000.

Now let’s bring our train and test dataset back together to do some feature engineering.

Let’s take a look at the importance of our features with a simple linear model.

##                           Overall
## MSSubClass           1.680097e-01
## LotFrontage          1.765592e-01
## LotArea              2.044155e+00
## OverallQual          3.796349e+00
## OverallCond          9.622664e-01
## YearBuilt            2.365654e+00
## YearRemodAdd         5.157381e-01
## MasVnrArea           8.220716e-01
## ExterQual            3.269603e+00
## ExterCond            2.007137e+00
## BsmtQual             6.203299e+00
## BsmtCond             3.680787e-01
## BsmtExposure         4.332946e+00
## BsmtFinType1         1.129756e+00
## BsmtFinSF1           5.697591e+00
## BsmtFinType2         5.969373e-01
## BsmtFinSF2           5.964108e+00
## BsmtUnfSF            7.624992e+00
## HeatingQC            5.532932e+00
## X1stFlrSF            1.004104e+00
## X2ndFlrSF            4.677992e-01
## LowQualFinSF         2.828565e+00
## BsmtFullBath         3.127321e+00
## BsmtHalfBath         1.415441e+00
## FullBath             3.764033e+00
## HalfBath             3.601990e+00
## BedroomAbvGr         1.417797e+00
## KitchenAbvGr         5.679278e-01
## KitchenQual          3.795240e-01
## TotRmsAbvGrd         1.614192e+00
## Fireplaces           1.531044e+00
## FireplaceQu          6.175305e+00
## GarageYrBlt          6.329881e-01
## GarageCars           1.894723e+00
## GarageArea           2.612613e+00
## GarageQual           2.278394e-01
## GarageCond           5.494691e-01
## WoodDeckSF           6.012606e+00
## OpenPorchSF          2.213197e+00
## EnclosedPorch        1.923976e+00
## X3SsnPorch           5.245668e-01
## ScreenPorch          2.692368e-02
## PoolArea             1.868264e-01
## PoolQC               8.272682e-02
## Fence                8.065488e+00
## MiscVal              2.381341e-01
## MoSold               1.760673e+00
## YrSold               5.659477e-01
## percentHouse         8.353356e-01
## percentHouseGarage   1.882891e+00
## GoodNeighborhood     3.026729e+00
## BadNeighborhood      1.088519e+00
## OverallQualxCond     3.587357e+00
## BsmtFinPercentSF     9.106924e-01
## TotalBathsperSF      1.939343e+00
## FireplacesxQual      6.274650e-01
## KitchensxQual        7.206375e-01
## SFxQual              1.071867e+01
## LQPErcentSF          2.051748e+00
## GarageSFperCar       7.919911e-01
## Flr2percentofFlr1    2.748062e+00
## percentFl1           1.830893e+00
## AvgAboveGRoomSize    1.286782e+00
## LotAreaoverFrontage  6.817430e-01
## BsmtSFxQual          4.507261e+00
## Has2Stories          7.860663e-01
## HasBasement          1.887314e+00
## HasPool              8.653526e-04
## SimpleOverallQual    1.959272e+00
## SimpleOverallCond    5.029794e-01
## SimpleLotArea        1.411329e-02
## SimpleYearBuilt      3.663488e-01
## SimpleYearRemodAdd   4.217705e-02
## SimpleTotalInsideSF  2.706979e+00
## SeasonSold           1.488035e+00
## `MSZoningC (all)`    4.630056e+00
## MSZoningFV           2.788425e+00
## MSZoningRH           1.019120e+00
## MSZoningRL           8.297192e-01
## StreetGrvl           6.797203e-01
## AlleyGrvl            7.603106e-01
## AlleyNo              5.312885e-01
## LotShapeIR1          1.909638e+00
## LotShapeIR2          1.009988e+00
## LotShapeIR3          3.135676e-01
## LandContourBnk       2.775576e-01
## LandContourHLS       1.319667e+00
## LandContourLow       2.085593e+00
## UtilitiesAllPub      1.045859e+00
## LotConfigCorner      8.225090e-01
## LotConfigCulDSac     2.462361e+00
## LotConfigFR2         2.010955e+00
## LotConfigFR3         2.034813e+00
## LandSlopeGtl         9.761301e-01
## LandSlopeMod         1.259337e+00
## NeighborhoodBlmngtn  4.113385e-01
## NeighborhoodBlueste  2.162810e+00
## NeighborhoodBrDale   6.844212e-01
## NeighborhoodBrkSide  4.584528e-03
## NeighborhoodClearCr  6.376745e-01
## NeighborhoodCollgCr  1.926425e+00
## NeighborhoodCrawfor  1.780462e+00
## NeighborhoodEdwards  2.105532e+00
## NeighborhoodGilbert  1.562743e+00
## NeighborhoodIDOTRR   4.269220e-01
## NeighborhoodMeadowV  3.256500e-01
## NeighborhoodMitchel  2.248399e+00
## NeighborhoodNAmes    1.862019e+00
## NeighborhoodNoRidge  4.409449e+00
## NeighborhoodNPkVill  7.848236e-01
## NeighborhoodNridgHt  4.143219e+00
## NeighborhoodNWAmes   1.980629e+00
## NeighborhoodOldTown  3.306485e-01
## NeighborhoodSawyer   1.332622e+00
## NeighborhoodSawyerW  4.621692e-01
## NeighborhoodSomerst  1.348079e+00
## NeighborhoodTimber   1.705572e+00
## Condition1Artery     4.888831e-01
## Condition1Feedr      2.698038e-02
## Condition1Norm       5.758299e-01
## Condition1PosA       4.669808e-03
## Condition1PosN       2.087548e-01
## Condition1RRAe       1.440549e+00
## Condition1RRAn       1.198545e-01
## Condition1RRNe       4.780986e-01
## Condition2Artery     3.774422e-02
## Condition2Feedr      1.787274e-02
## Condition2Norm       1.502499e-01
## Condition2PosA       4.051186e-01
## Condition2PosN       1.605917e+00
## Condition2RRAe       1.592848e+00
## Condition2RRAn       3.473550e-01
## BldgType1Fam         6.687288e-01
## BldgType2fmCon       8.676784e-01
## BldgTypeDuplex       2.675956e-01
## BldgTypeTwnhs        8.217463e-02
## HouseStyle1.5Fin     3.518496e-01
## HouseStyle1.5Unf     2.578729e-01
## HouseStyle1Story     1.397168e-01
## HouseStyle2.5Fin     1.741301e+00
## HouseStyle2.5Unf     2.337347e-01
## HouseStyle2Story     1.731853e-01
## HouseStyleSFoyer     8.237646e-01
## RoofStyleFlat        1.876056e+00
## RoofStyleGable       2.079221e+00
## RoofStyleGambrel     2.091246e+00
## RoofStyleHip         2.073792e+00
## RoofStyleMansard     1.972305e+00
## RoofMatlCompShg      3.758649e-01
## RoofMatlMetal        6.824457e-01
## RoofMatlRoll         4.406007e-01
## `RoofMatlTar&Grv`    5.884704e-02
## RoofMatlWdShake      6.132312e-01
## Exterior1stAsbShng   2.787260e-01
## Exterior1stAsphShn   9.579359e-01
## Exterior1stBrkComm   9.894096e-01
## Exterior1stBrkFace   2.249790e+00
## Exterior1stCBlock    8.438300e-02
## Exterior1stCemntBd   2.049460e+00
## Exterior1stHdBoard   9.637580e-01
## Exterior1stImStucc   1.281198e+00
## Exterior1stMetalSd   8.328004e-02
## Exterior1stPlywood   9.337231e-01
## Exterior1stStone     5.297039e-02
## Exterior1stStucco    1.624742e-01
## Exterior1stVinylSd   3.605513e-01
## `Exterior1stWd Sdng` 1.701192e+00
## Exterior2ndAsbShng   1.953916e-01
## Exterior2ndAsphShn   3.135820e-01
## `Exterior2ndBrk Cmn` 1.109379e+00
## Exterior2ndBrkFace   5.916532e-01
## Exterior2ndCmentBd   2.197466e+00
## Exterior2ndHdBoard   3.462801e-01
## Exterior2ndImStucc   6.415420e-01
## Exterior2ndMetalSd   1.900926e-01
## Exterior2ndOther     1.097401e+00
## Exterior2ndPlywood   2.255570e-01
## Exterior2ndStone     2.137113e-01
## Exterior2ndStucco    6.758661e-01
## Exterior2ndVinylSd   5.723081e-01
## `Exterior2ndWd Sdng` 1.898363e+00
## MasVnrTypeBrkCmn     2.078606e+00
## MasVnrTypeBrkFace    2.975114e+00
## MasVnrTypeNone       3.286975e+00
## FoundationBrkTil     1.728084e+00
## FoundationCBlock     2.064199e+00
## FoundationPConc      2.391359e+00
## FoundationSlab       1.195395e+00
## FoundationStone      2.505999e+00
## HeatingFloor         6.078845e-01
## HeatingGasA          4.822663e-02
## HeatingGasW          1.024229e-02
## HeatingGrav          4.079993e-01
## HeatingOthW          3.205415e-01
## CentralAirN          1.676277e+00
## ElectricalFuseA      2.728565e-02
## ElectricalFuseF      5.820782e-01
## ElectricalFuseP      3.269400e-01
## ElectricalMix        4.067518e-01
## FunctionalMaj1       3.954265e+00
## FunctionalMaj2       2.474029e+00
## FunctionalMin1       2.132648e+00
## FunctionalMin2       1.840222e+00
## FunctionalMod        3.115766e+00
## FunctionalSev        2.005316e+00
## GarageType2Types     1.135673e+00
## GarageTypeAttchd     1.060783e+00
## GarageTypeBasment    7.545961e-01
## GarageTypeBuiltIn    4.058432e-01
## GarageTypeCarPort    1.060592e+00
## GarageTypeDetchd     8.745200e-01
## GarageFinishFin      2.629584e-01
## GarageFinishRFn      3.166345e-01
## PavedDriveN          6.959825e-01
## PavedDriveP          1.489969e+00
## MiscFeatureGar2      7.209261e-01
## MiscFeatureNone      2.035848e+00
## MiscFeatureOthr      1.788292e+00
## MiscFeatureShed      2.064403e+00
## SaleTypeCOD          2.494546e-01
## SaleTypeCon          1.754641e+00
## SaleTypeConLD        2.645492e+00
## SaleTypeConLI        1.421714e-01
## SaleTypeConLw        6.404067e-01
## SaleTypeCWD          1.372885e+00
## SaleTypeNew          4.358512e-01
## SaleTypeOth          4.022196e-01
## SaleConditionAbnorml 1.186427e+00
## SaleConditionAdjLand 3.230346e-01
## SaleConditionAlloca  6.086550e-01
## SaleConditionFamily  9.826569e-01
## SaleConditionNormal  5.678404e-01

Now it’s time to train the model.

## + Fold1: lambda=5e-06, penalty=MCP 
## - Fold1: lambda=5e-06, penalty=MCP 
## + Fold2: lambda=5e-06, penalty=MCP 
## - Fold2: lambda=5e-06, penalty=MCP 
## + Fold3: lambda=5e-06, penalty=MCP 
## - Fold3: lambda=5e-06, penalty=MCP 
## Aggregating results
## Fitting final model on full training set
## + Fold1: lambda=0.002 
## - Fold1: lambda=0.002 
## + Fold2: lambda=0.002 
## - Fold2: lambda=0.002 
## + Fold3: lambda=0.002 
## - Fold3: lambda=0.002 
## Aggregating results
## Fitting final model on full training set
## + Fold1: C=0.6 
## - Fold1: C=0.6 
## + Fold2: C=0.6 
## - Fold2: C=0.6 
## + Fold3: C=0.6 
## - Fold3: C=0.6 
## Aggregating results
## Fitting final model on full training set

These algorithms were chosen after doing spot checks on their initial performance, then their parameters were tuned.

## 
## Call:
## summary.resamples(object = results)
## 
## Models: rqnc, rqlasso, svmLinear 
## Number of resamples: 3 
## 
## RMSE 
##            Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## rqnc      20390   20650  20920 20840   21060 21210    0
## rqlasso   20500   20720  20940 20900   21100 21250    0
## svmLinear 20610   20920  21230 21290   21630 22020    0
## 
## Rsquared 
##             Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## rqnc      0.9097  0.9132 0.9168 0.9182  0.9225 0.9282    0
## rqlasso   0.9127  0.9143 0.9159 0.9183  0.9212 0.9264    0
## svmLinear 0.9090  0.9120 0.9150 0.9147  0.9176 0.9202    0

##                rqnc   rqlasso svmLinear
## rqnc      1.0000000 0.7176324 0.5843642
## rqlasso   0.7176324 1.0000000 0.9844993
## svmLinear 0.5843642 0.9844993 1.0000000

Ensemble the models together.

##   parameter     RMSE  Rsquared   RMSESD  RsquaredSD
## 1      none 19378.56 0.9263387 1179.276 0.006171796
## The following models were ensembled: rqnc, rqlasso, svmLinear 
## They were weighted: 
## -1237.7545 0.602 0.1165 0.2919
## The resulting RMSE is: 19378.5561
## The fit for each individual model on the RMSE is: 
##     method     RMSE   RMSESD
##       rqnc 20838.52 415.6135
##    rqlasso 20896.68 375.5777
##  svmLinear 21287.45 710.3928

Summary of input algorithms, then ensembled model:

##       rqnc           rqlasso         svmLinear     
##  Min.   : 47311   Min.   : 41490   Min.   : 35167  
##  1st Qu.:128048   1st Qu.:128154   1st Qu.:129428  
##  Median :159858   Median :159562   Median :160276  
##  Mean   :179214   Mean   :179113   Mean   :179877  
##  3rd Qu.:211402   3rd Qu.:209913   3rd Qu.:212640  
##  Max.   :878680   Max.   :882979   Max.   :849266
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   58730  127100  160700  179200  210000  479100

RMSE score:

## [1] 0.1198852

First ten predicted housing prices of the dataset to be submitted for competition.

##        Id SalePrice
## 1461 1461  120589.7
## 1462 1462  165167.5
## 1463 1463  192201.0
## 1464 1464  203259.7
## 1465 1465  196737.3
## 1466 1466  172754.8
## 1467 1467  183232.5
## 1468 1468  163851.5
## 1469 1469  193996.5
## 1470 1470  124049.0